efg | 2018-09-03

Setup

NOTE: place tidyverse after MASS below to avoid dplyr::select function clashes with MASS::select.

library(MASS)          # fgl data
library(tidyverse)     # place after MASS to avoid select conflict
library(caret)         # preProcess, predict
library(rgl)           # par3d, plot3d, movie3d, rglwidget
library(RColorBrewer)  # brewer.pal

magick from ImageMagick must be installed to created the animated GIF of the PCA.

Forensic Glass Data

typeColorIndex <- as.integer(fgl$type)
rawData <- fgl %>% select(-type)

Principal Component Analysis

Let’s compute the values for the first 4 principal components using caret’s pca pre-processing.

“pca” requires “center” and “scale”.

These four PCs account for nearly 80% of variance.

nPCAcomponents <- 4
transformSetup <- preProcess(rawData, method=c("center", "scale", "pca"), pcaComp=nPCAcomponents)
pcaScores <- predict(transformSetup, rawData)
pcaScores %>% head(10)
            PC1        PC2        PC3         PC4
1  -1.148446843 -0.5282491  0.3712253 -1.72485681
2   0.572794160 -0.7580105  0.5554059 -0.75845396
3   0.937960515 -0.9276609  0.5536094 -0.20577184
4   0.141750924 -0.9594279  0.1168507 -0.41475157
5   0.350271021 -1.0886966  0.4839440 -0.06894065
6   0.289587596 -1.3209105 -0.8666466  0.92562711
7   0.252080399 -1.1135387  0.5393732 -0.08014021
8   0.120018063 -1.2189881  0.6232809  0.11575068
9   0.020767338 -0.3211795  0.1088046 -1.36846360
10  0.002346727 -1.0633203 -0.1205723  0.37514145

You can verify preProcess gives the same PCAscores as in the SVD notebook.

Each PC is a weighted linear combination of all variables. PCs are orthogonal.

PCs can be used as variables in other machine learning algorithms.

Machine learning algorithm using PCs as predictors are limited by the amount of variance explained by the original variables in the given number of PCs.

Interactive 3D scatterplot of first 3 principal components

Project 9-dimensional data onto 3 dimensions for display.

The first 3 PCs account for about 66% of variance in data.

Other sets of 3 PCs could be dislayed alternatively in 3D space, such as PC2, PC3, PC4.

typeColors <- brewer.pal(length(levels(fgl$type)), "Dark2")   
par3d("windowRect"=c(50,50,800,800))
plot3d(x=pcaScores$PC1, y=pcaScores$PC2, z=pcaScores$PC3, 
       col=typeColors[typeColorIndex],
  xlab="PC1", ylab="PC2", zlab="PC3", type="s", size= 3)
rglwidget(elementId="FGL1")

Chrome browser works best to display above figure.

Drag mouse over figure to rotate. Use mouse wheel to zoom in and out.

Legend

x <- barplot(rep(1,6), yaxt="n", col=typeColors)
text(x, 0.5, levels(fgl$type))

Automatically rotate for about 15 seconds when created.

Note the “Home” instances form a fairly good cluster, but the other types not so much.

DO NOT close separate RGL device window, or the following will create an error.

play3d(spin3d(), duration=15)

Animated GIF

Create the animated GIF using magick from ImageMagick – this takes some time. Display below using HTML.

150 PNG images will be computed for 15 sec duration * 10 frames/second.

movie3d(spin3d(), duration = 15, dir = getwd(),
        movie="ForensicGlass-PCA",
        verbose=FALSE, convert="magick -delay 1x%d %s*.png %s.%s")

Here’s the HTML needed in the R Markdown document to embed the GIF into the HTML file created with knitr.

<div id="PCA">
  <img src="ForensicGlass-PCA.gif" alt="">
</div>

Processing time: 68.6 sec

2018-09-03 23:08:06

References

Understanding PCA using Shiny and Stack Overflow data (video), Julia Silge, Stack Overflow, at RStudio 2018. Related blog posting Understanding PCA using Stack Overflow Data, Julia Silge, May 2018.

Practical Guide to Principal Component Analysis (PCA) in R & Python from Analytics Vidhya, 2016.

Computing and visualizing PCA in R by Thiago G. Martins, 2013.

Introduction to Principal Component Analysis (PCA) by Thiago G. Martins, 2013.

Principal Components Analysis notes from class given by Brian Junker and Cosma Shalizi at CMU, 2010.

Principal Components Analysis: A How-To Manual for R by Emily Mankin. Includes

Naive Principal Component Analysis in R, Data Science Central, Pablo Bernabeu, 2017.